Skip to content

[IR] Add llvm.masked.load.first.fault intrinsic#156470

Open
huntergr-arm wants to merge 7 commits intollvm:mainfrom
huntergr-arm:speculative-load-intrinsic
Open

[IR] Add llvm.masked.load.first.fault intrinsic#156470
huntergr-arm wants to merge 7 commits intollvm:mainfrom
huntergr-arm:speculative-load-intrinsic

Conversation

@huntergr-arm
Copy link
Copy Markdown
Collaborator

@huntergr-arm huntergr-arm commented Sep 2, 2025

In order to support loading from addresses which may not be valid at
runtime without generating faults, we introduce the first fault load
intrinsic. Loading with this intrinsic will only generate a fault for
invalid accesses on the first element of the vector. Any subsequent
fault will be suppressed and the corresponding data will be poison.

This PR contains target-independent scalarization of the intrinsic so
that generic codegen works.

Copy link
Copy Markdown
Contributor

@david-arm david-arm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to split this patch up into two parts:

  1. A patch to introduce the LangRef and intrinsic along with the scalarisation code (on by default for all targets).
  2. A follow-on patch to add the TTI hook and the codegen.

What do you think?

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s

define { <4 x i32>, <4 x i1> } @speculative_load_v4i32_neon(ptr %p, <4 x i1> %mask) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth having a separate test file for fixed-width perhaps? For fixed-width I think we want to have different RUN lines for normal SVE and streaming mode. For the latter we can't use NEON and so will have to fall back on SVE.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've only included fixed vector tests for now as we won't scalarize scalable vectors.

@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Oct 30, 2025

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

In order to support loading from addresses which may not be valid at
runtime without generating faults, we introduce the speculative load
intrinsic. Loading with this intrinsic will only generate a fault for
invalid accesses on the first element of the vector. Any subsequent
fault will be suppressed and the corresponding data will be poison.

This PR contains both target-independent and AArch64-specific codegen
for the intrinsic.


Full diff: https://github.com/llvm/llvm-project/pull/156470.diff

4 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+59)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+8)
  • (modified) llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp (+66)
  • (added) llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll (+79)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 1c6823be44dcb..92ea6554c4aba 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -27098,6 +27098,65 @@ The '``llvm.masked.compressstore``' intrinsic is designed for compressing data i
 
 Other targets may support this intrinsic differently, for example, by lowering it into a sequence of branches that guard scalar store operations.
 
+.. _int_mspecload:
+
+'``llvm.masked.speculative.load.*``' Intrinsics
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic. The loaded data is a vector of any integer, floating-point or pointer data type.
+
+::
+
+      declare { <16 x float>, <16 x i1> } @llvm.masked.speculative.load.v16f32.p0(ptr <ptr>, i32 <alignment>, <16 x i1> <mask>)
+      declare { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr <ptr>, i32 <alignment>, <2 x i1> <mask>)
+      ;; The data is a vector of pointers
+      declare { <8 x ptr>, <8 x i1> } @llvm.masked.speculative.load.v8p0.p0(ptr <ptr>, i32 <alignment>, <8 x i1> <mask>)
+
+Overview:
+"""""""""
+
+Reads a vector from memory according to the provided mask, suppressing faults
+for any lane beyond the first. The mask holds a bit for each vector lane, and
+is used to prevent memory accesses to the masked-off lanes. Inactive lanes will
+be zero in the result vector.
+
+Returns the loaded data and a mask indicating which lanes are valid, which may
+not be the same as the input mask depending on whether the processor encountered
+a reason to avoid loading from that address.
+
+Arguments:
+""""""""""
+
+The first argument is the base pointer for the load. The second argument is the
+alignment of the source location. It must be a power of two constant integer
+value. The third argument, mask, is a vector of boolean values with the same
+number of elements as the return type.
+
+Semantics:
+""""""""""
+
+The '``llvm.masked.speculative.load``' intrinsic is similar to the
+'``llvm.masked.load``' intrinsic, in that it conditionally loads values from
+memory into a vector based on a mask. However, it allows loading from addresses
+which may not be entirely safe. If the memory corresponding to the first element
+of the vector is inaccessible, then a fault will be raised as normal. For all
+subsequent lanes faults will be suppressed and the corresponding bit in the
+output mask will be marked inactive. The remaining elements in the output mask
+after a suppressed fault will also be marked inactive. Elements with active bits
+in the input mask will be poison values if the corresponding bit is inactive in
+the output mask.
+
+Reasons for marking output elements inactive are processor dependent; it may be
+a genuine fault, e.g. if the range of the data being loaded spans a page
+boundary and the page at the higher address is not mapped. But a given
+processor may also mark elements as inactive for other reasons, such as a cache
+miss. Code using this intrinsic must take this into account and not assume that
+inactive lanes signal the end of accessible memory. If more data should be
+loaded based on the semantics of the user code, then the base pointer should be
+advanced to the address of the first inactive element and a new speculative load
+attempted.
 
 Memory Use Markers
 ------------------
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 4d59ee8676b9e..74fed62fcc1dd 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -2505,6 +2505,14 @@ def int_masked_compressstore:
             [IntrWriteMem, IntrArgMemOnly,
              NoCapture<ArgIndex<1>>]>;
 
+def int_masked_speculative_load
+    : DefaultAttrsIntrinsic<[llvm_anyvector_ty,
+                             LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+                            [llvm_anyptr_ty, llvm_i32_ty,
+                             LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
+                            [IntrReadMem, IntrArgMemOnly, ImmArg<ArgIndex<1>>,
+                             NoCapture<ArgIndex<0>>]>;
+
 def int_experimental_vector_compress:
     DefaultAttrsIntrinsic<[llvm_anyvector_ty],
               [LLVMMatchType<0>, LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>, LLVMMatchType<0>],
diff --git a/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp b/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
index 146e7d1047dd0..531d79f89f7cd 100644
--- a/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
+++ b/llvm/lib/Transforms/Scalar/ScalarizeMaskedMemIntrin.cpp
@@ -1031,6 +1031,68 @@ static void scalarizeMaskedVectorHistogram(const DataLayout &DL, CallInst *CI,
   ModifiedDT = true;
 }
 
+static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI,
+                                           DomTreeUpdater *DTU,
+                                           bool &ModifiedDT) {
+  // For a target without speculative/first-faulting load support, we can't
+  // actually scalarize accesses for all lanes. However, lanes beyond the
+  // first may be considered inactive due to reasons beyond a fault, so for
+  // generic 'scalarization' we can just load the first lane (if the
+  // corresponding input mask bit is active), then mark all other lanes as
+  // inactive in the output mask and embed the first lane into a vector of
+  // poison.
+  Value *Ptr = CI->getArgOperand(0);
+  Value *Align = CI->getArgOperand(1);
+  Value *Mask = CI->getArgOperand(2);
+  StructType *RetTy = cast<StructType>(CI->getType());
+  VectorType *DataTy = cast<VectorType>(RetTy->getElementType(0));
+  VectorType *MaskTy = cast<VectorType>(RetTy->getElementType(1));
+  Type *ScalarTy = DataTy->getScalarType();
+
+  MaybeAlign AlignVal = cast<ConstantInt>(Align)->getMaybeAlignValue();
+
+  IRBuilder<> Builder(CI->getContext());
+  BasicBlock *IfBlock = CI->getParent();
+  Builder.SetInsertPoint(CI);
+  Builder.SetCurrentDebugLocation(CI->getDebugLoc());
+  Value *EmptyMask = Constant::getNullValue(MaskTy);
+  Value *PoisonData = PoisonValue::get(DataTy);
+
+  // FIXME: If the mask is a constant, we can skip the extract.
+  Value *FirstActive =
+      Builder.CreateExtractElement(Mask, 0ul, Twine("first.active"));
+  Instruction *ThenTerm =
+      SplitBlockAndInsertIfThen(FirstActive, CI,
+                                /*Unreachable=*/false,
+                                /*BranchWeights=*/nullptr, DTU);
+
+  BasicBlock *ThenBlock = ThenTerm->getParent();
+  ThenBlock->setName("speculative.load.first.lane");
+  Builder.SetInsertPoint(ThenBlock->getTerminator());
+  LoadInst *Load = Builder.CreateAlignedLoad(ScalarTy, Ptr, AlignVal);
+  Value *OneLaneData = Builder.CreateInsertElement(PoisonData, Load, 0ul);
+  Value *OneLaneMask = Builder.CreateInsertElement(
+      EmptyMask, Constant::getAllOnesValue(MaskTy->getElementType()), 0ul);
+
+  Builder.SetInsertPoint(CI);
+  PHINode *ResData = Builder.CreatePHI(DataTy, 2);
+  ResData->addIncoming(PoisonData, IfBlock);
+  ResData->addIncoming(OneLaneData, ThenBlock);
+  PHINode *ResMask = Builder.CreatePHI(MaskTy, 2);
+  ResMask->addIncoming(EmptyMask, IfBlock);
+  ResMask->addIncoming(OneLaneMask, ThenBlock);
+
+  Value *Result = PoisonValue::get(RetTy);
+  Result = Builder.CreateInsertValue(Result, ResData, 0ul);
+  Result = Builder.CreateInsertValue(Result, ResMask, 1ul);
+  if (CI->hasName())
+    Result->setName(CI->getName() + ".first.lane.only");
+  CI->getParent()->setName("speculative.result");
+  CI->replaceAllUsesWith(Result);
+  CI->eraseFromParent();
+  ModifiedDT = true;
+}
+
 static bool runImpl(Function &F, const TargetTransformInfo &TTI,
                     DominatorTree *DT) {
   std::optional<DomTreeUpdater> DTU;
@@ -1170,8 +1232,12 @@ static bool optimizeCallInst(CallInst *CI, bool &ModifiedDT,
       scalarizeMaskedCompressStore(DL, HasBranchDivergence, CI, DTU,
                                    ModifiedDT);
       return true;
+    case Intrinsic::masked_speculative_load: {
+      scalarizeMaskedSpeculativeLoad(DL, CI, DTU, ModifiedDT);
+      return true;
     }
   }
+  }
 
   return false;
 }
diff --git a/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll b/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll
new file mode 100644
index 0000000000000..e8facb8a5683b
--- /dev/null
+++ b/llvm/test/CodeGen/AArch64/masked-speculative-load-fixed-vectors.ll
@@ -0,0 +1,79 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=aarch64-linux-gnu < %s | FileCheck %s --check-prefixes=NEON
+; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s | FileCheck %s --check-prefixes=SVE
+
+define { <4 x i32>, <4 x i1> } @speculative_load_v4i32(ptr %p, <4 x i1> %mask) {
+; NEON-LABEL: speculative_load_v4i32:
+; NEON:       // %bb.0:
+; NEON-NEXT:    // kill: def $d0 killed $d0 def $q0
+; NEON-NEXT:    umov w8, v0.h[0]
+; NEON-NEXT:    tbz w8, #0, .LBB0_2
+; NEON-NEXT:  // %bb.1: // %speculative.load.first.lane
+; NEON-NEXT:    adrp x8, .LCPI0_0
+; NEON-NEXT:    ldr s0, [x0]
+; NEON-NEXT:    ldr d1, [x8, :lo12:.LCPI0_0]
+; NEON-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; NEON-NEXT:    ret
+; NEON-NEXT:  .LBB0_2:
+; NEON-NEXT:    movi v1.2d, #0000000000000000
+; NEON-NEXT:    // implicit-def: $q0
+; NEON-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; NEON-NEXT:    ret
+;
+; SVE-LABEL: speculative_load_v4i32:
+; SVE:       // %bb.0:
+; SVE-NEXT:    // kill: def $d0 killed $d0 def $q0
+; SVE-NEXT:    umov w8, v0.h[0]
+; SVE-NEXT:    tbz w8, #0, .LBB0_2
+; SVE-NEXT:  // %bb.1: // %speculative.load.first.lane
+; SVE-NEXT:    adrp x8, .LCPI0_0
+; SVE-NEXT:    ldr s0, [x0]
+; SVE-NEXT:    ldr d1, [x8, :lo12:.LCPI0_0]
+; SVE-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; SVE-NEXT:    ret
+; SVE-NEXT:  .LBB0_2:
+; SVE-NEXT:    movi v1.2d, #0000000000000000
+; SVE-NEXT:    // implicit-def: $q0
+; SVE-NEXT:    // kill: def $d1 killed $d1 killed $q1
+; SVE-NEXT:    ret
+  %res = call { <4 x i32>, <4 x i1> } @llvm.masked.speculative.load.v4i32.p0(ptr %p, i32 16, <4 x i1> %mask)
+  ret { <4 x i32>, <4 x i1> } %res
+}
+
+;; FIXME: If we know the input mask is all-true and the vector is fully aligned,
+;;        we should be able to use a normal NEON load here.
+define { <2 x double>, <2 x i1> } @speculative_load_v2f64_all_true_fully_aligned(ptr %p) {
+; NEON-LABEL: speculative_load_v2f64_all_true_fully_aligned:
+; NEON:       // %bb.0: // %speculative.load.first.lane
+; NEON-NEXT:    adrp x8, .LCPI1_0
+; NEON-NEXT:    ldr d0, [x0]
+; NEON-NEXT:    ldr d1, [x8, :lo12:.LCPI1_0]
+; NEON-NEXT:    ret
+;
+; SVE-LABEL: speculative_load_v2f64_all_true_fully_aligned:
+; SVE:       // %bb.0: // %speculative.load.first.lane
+; SVE-NEXT:    ldr d0, [x0]
+; SVE-NEXT:    index z1.s, #1, #-1
+; SVE-NEXT:    // kill: def $d1 killed $d1 killed $z1
+; SVE-NEXT:    ret
+  %res = call { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr %p, i32 16, <2 x i1> <i1 true, i1 true>)
+  ret { <2 x double>, <2 x i1> } %res
+}
+
+define { <2 x double>, <2 x i1> } @speculative_load_v2f64_all_true_partially_aligned(ptr %p) {
+; NEON-LABEL: speculative_load_v2f64_all_true_partially_aligned:
+; NEON:       // %bb.0: // %speculative.load.first.lane
+; NEON-NEXT:    adrp x8, .LCPI2_0
+; NEON-NEXT:    ldr d0, [x0]
+; NEON-NEXT:    ldr d1, [x8, :lo12:.LCPI2_0]
+; NEON-NEXT:    ret
+;
+; SVE-LABEL: speculative_load_v2f64_all_true_partially_aligned:
+; SVE:       // %bb.0: // %speculative.load.first.lane
+; SVE-NEXT:    ldr d0, [x0]
+; SVE-NEXT:    index z1.s, #1, #-1
+; SVE-NEXT:    // kill: def $d1 killed $d1 killed $z1
+; SVE-NEXT:    ret
+  %res = call { <2 x double>, <2 x i1> } @llvm.masked.speculative.load.v2f64.p0(ptr %p, i32 8, <2 x i1> <i1 true, i1 true>)
+  ret { <2 x double>, <2 x i1> } %res
+}

@huntergr-arm
Copy link
Copy Markdown
Collaborator Author

It might be a good idea to split this patch up into two parts:

  1. A patch to introduce the LangRef and intrinsic along with the scalarisation code (on by default for all targets).
  2. A follow-on patch to add the TTI hook and the codegen.

What do you think?

Yep, that's fine. I'll use this PR for the intrinsic and raise the AArch64 PR after this lands.

@huntergr-arm huntergr-arm changed the title [IR][AArch64] Add llvm.masked.speculative.load intrinsic [IR] Add llvm.masked.speculative.load intrinsic Oct 30, 2025
static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI,
DomTreeUpdater *DTU,
bool &ModifiedDT) {
// For a target without speculative/first-faulting load support, we can't
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to have an IR test with -passes=scalarize-masked-mem-intrin, similar to llvm/test/Transforms/ScalarizeMaskedMemIntrin/AArch64/streaming-compatible-expand-masked-gather-scatter.ll.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ModifiedDT = true;
}

static void scalarizeMaskedSpeculativeLoad(const DataLayout &DL, CallInst *CI,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other scalarize* functions have an a doc comment with an example, I think that'd be worth adding here too.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Syntax:
"""""""
This is an overloaded intrinsic. The loaded data is a vector of any integer, floating-point or pointer data type.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nit: line wrap this

Value *EmptyMask = Constant::getNullValue(MaskTy);
Value *PoisonData = PoisonValue::get(DataTy);

// FIXME: If the mask is a constant, we can skip the extract.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth handling this here? -passes=instcombine,simplifycfg would do the same.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines +27144 to +27145
of the vector is inaccessible, then a fault will be raised as normal. For all
subsequent lanes faults will be suppressed and the corresponding bit in the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
of the vector is inaccessible, then a fault will be raised as normal. For all
subsequent lanes faults will be suppressed and the corresponding bit in the
of the vector is inaccessible, then a fault will be raised as normal. For all subsequent
lanes, if a fault occurs, it will be suppressed and the corresponding bit in the

the output mask.

Reasons for marking output elements inactive are processor dependent; it may be
a genuine fault, e.g. if the range of the data being loaded spans a page
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth saying a reason for marking lanes inactive could also be lack of support for suppressing faults?

@huntergr-arm huntergr-arm force-pushed the speculative-load-intrinsic branch from 364c8c4 to 7a8355f Compare February 10, 2026 11:24
@huntergr-arm huntergr-arm changed the title [IR] Add llvm.masked.speculative.load intrinsic [IR] Add llvm.masked.load.first.fault intrinsic Feb 10, 2026
@huntergr-arm
Copy link
Copy Markdown
Collaborator Author

Changed to just focus on first faulting loads, since there's already a .vp evl based intrinsic for this and #179642 should cover the page-based speculation.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 10, 2026

🪟 Windows x64 Test Results

  • 132389 tests passed
  • 3009 tests skipped

✅ The build succeeded and all tests passed.

declare { <16 x float>, <16 x i1> } @llvm.masked.load.first.fault.v16f32.p0(ptr <ptr>, i32 <alignment>, <16 x i1> <mask>)
declare { <2 x double>, <2 x i1> } @llvm.masked.load.first.fault.v2f64.p0(ptr <ptr>, i32 <alignment>, <2 x i1> <mask>)
;; The data is a vector of pointers
declare { <8 x ptr>, <8 x i1> } @llvm.masked.load.first.fault.v8p0.p0(ptr <ptr>, i32 <alignment>, <8 x i1> <mask>)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a passthru argument similar to the generic masked loads? cc @paulwalker-arm

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want a passthru for other lanes, given that we will likely stop processing anything past a 'fault' for something loaded to determine early exit conditions in the current vector iteration?

The equivalent evl intrinsic (llvm.vp.load.ff) doesn't have a passthru either.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, defining the other lanes as poison seems fine. If users want to set them to a known value they can use a select instruction to set them.

Copy link
Copy Markdown
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC the new intrinsic has the same semantics as llvm.vp.load.ff, modulo the extra EVL argument?

If so, it would be nice if it would be possible to unify them if possible, not sure if we could have an optional EVL argument, which defaults to the number of elements of the vector type if omitted?

If that's not possible, it would be good to try to keep the names consistent (load.ff vs load.first.fault) and possibly share the wording for the semantics definition somehow, possibly referencing vp.load.ff

@huntergr-arm
Copy link
Copy Markdown
Collaborator Author

huntergr-arm commented Feb 12, 2026

IIUC the new intrinsic has the same semantics as llvm.vp.load.ff, modulo the extra EVL argument?

If so, it would be nice if it would be possible to unify them if possible, not sure if we could have an optional EVL argument, which defaults to the number of elements of the vector type if omitted?

If that's not possible, it would be good to try to keep the names consistent (load.ff vs load.first.fault) and possibly share the wording for the semantics definition somehow, possibly referencing vp.load.ff

Sure, I can use .ff, I just figured it might be a little more obvious what it was spelled out. Looks like I can also remove the alignment arg and just use metadata from the ptr arg.

It returns a mask instead of a new evl, but that mask would be a 'prefix' mask so could be synthesized from the returned evl.

As far as the overlap in functionality goes, it's pretty much the same as llvm.masked.load vs. llvm.vp.load, or store,gather,scatter,etc.

Would operand bundles be the way to make an argument optional?

At some point, Arm would probably also want to introduce a llvm.masked.gather.ff, as well as llvm.masked.load.nf no-fault version to enable interleaving; I don't think RVV has those operations (yet?).

@fhahn
Copy link
Copy Markdown
Contributor

fhahn commented Feb 17, 2026

IIUC the new intrinsic has the same semantics as llvm.vp.load.ff, modulo the extra EVL argument?
If so, it would be nice if it would be possible to unify them if possible, not sure if we could have an optional EVL argument, which defaults to the number of elements of the vector type if omitted?
If that's not possible, it would be good to try to keep the names consistent (load.ff vs load.first.fault) and possibly share the wording for the semantics definition somehow, possibly referencing vp.load.ff

Sure, I can use .ff, I just figured it might be a little more obvious what it was spelled out. Looks like I can also remove the alignment arg and just use metadata from the ptr arg.

It returns a mask instead of a new evl, but that mask would be a 'prefix' mask so could be synthesized from the returned evl.

As far as the overlap in functionality goes, it's pretty much the same as llvm.masked.load vs. llvm.vp.load, or store,gather,scatter,etc.

Would operand bundles be the way to make an argument optional?

It would be possible to make it a variable argument intrinsic, but that isn't nice I think unfortunately, as it would require spelling out the full function type for each call IIRC. Operand bundles may be a way, but I am not sure if they are super well supported everywhere.

Maybe for the new intrinsic as first step, it would be sufficient to share the description of the semantics, and link between them in the LangRef?

@huntergr-arm huntergr-arm force-pushed the speculative-load-intrinsic branch from 1bc1d81 to 15eded6 Compare February 17, 2026 16:55
@huntergr-arm
Copy link
Copy Markdown
Collaborator Author

Renamed to use a .ff suffix, moved alignment to a pointer argument attribute, reduced semantics section to point to vp.load.ff and a brief explanation of the differences.

In order to support loading from addresses which may not be valid at
runtime without generating faults, we introduce the first fault load
intrinsic. Loading with this intrinsic will only generate a fault for
invalid accesses on the first element of the vector. Any subsequent
fault will be suppressed and the corresponding data will be poison.

This PR contains target-independent scalarization of the intrinsic so
that generic codegen works.
@huntergr-arm huntergr-arm force-pushed the speculative-load-intrinsic branch from 15eded6 to bed61d9 Compare March 17, 2026 16:32
Copy link
Copy Markdown
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this would help us with #151300 too since we would be able to do the EVL lowering the usual way by lowering to the masked version first, and then converting to the VP intrinsic.

Comment on lines +27854 to +27858
The '``llvm.masked.load.ff``' intrinsic is very similar to the
'``llvm.vp.load.ff``' intrinsic, with the differences being the lack of an EVL
parameter and the second returned value being a mask instead of an updated EVL
value.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that VP intrinsics are usually defined as a "layer above" masked intrinsics, should we instead modify the documentation llvm.vp.load.ff to mention it's similar to llvm.masked.load.ff instead?

Comment on lines +27859 to +27860
If the processor suppresses a fault for any lane, then the returned mask will
indicate that lane and all subsequent lanes are inactive.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the mask is a prefix mask? I.e. after the first set bit, every bit after will be set. I think we will need to depend on that property in the loop vectorizer. Does SVE guarantee that?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried to add expansion for vp.load.ff too in #154440 but I'm not sure if it's worth it.

The scalarization for the vp intrinsic works the same way as here where we rely on the fact that the intrinsic can arbitrarily reduce the number of non-faulted lanes to one.

But in that case I don't think it will ever be profitable to scalarize these intrinsics, because it means that a vectorized loop will always just be stepping through one element at a time. I.e. it's now a scalar loop.

I think we should just be returning an invalid cost if the target can't lower these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants